mab algorithm
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- Asia > Middle East > Jordan (0.04)
A Framework for Fair Evaluation of Variance-Aware Bandit Algorithms
Multi-armed bandit (MAB) problems serve as a fundamental building block for more complex reinforcement learning algorithms. However, evaluating and comparing MAB algorithms remains challenging due to the lack of standardized conditions and replicability. This is particularly problematic for variance-aware extensions of classical methods like UCB, whose performance can heavily depend on the underlying environment. In this study, we address how performance differences between bandit algorithms can be reliably observed, and under what conditions variance-aware algorithms outperform classical ones. We present a reproducible evaluation designed to systematically compare eight classical and variance-aware MAB algorithms. The evaluation framework, implemented in our Bandit Playground codebase, features clearly defined experimental setups, multiple performance metrics (reward, regret, reward distribution, value-at-risk, and action optimality), and an interactive evaluation interface that supports consistent and transparent analysis. We show that variance-aware algorithms can offer advantages in settings with high uncertainty where the difficulty arises from subtle differences between arm rewards. In contrast, classical algorithms often perform equally well or better in more separable scenarios or if fine-tuned extensively. Our contributions are twofold: (1) a framework for systematic evaluation of MAB algorithms, and (2) insights into the conditions under which variance-aware approaches outperform their classical counterparts.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Germany (0.04)
- Information Technology > Data Science > Data Mining > Big Data (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
Reinforcement Learning for Search Tree Size Minimization in Constraint Programming: New Results on Scheduling Benchmarks
Heinz, Vilém, Vilím, Petr, Hanzálek, Zdeněk
Failure-Directed Search (FDS) is a significant complete generic search algorithm used in Constraint Programming (CP) to efficiently explore the search space, proven particularly effective on scheduling problems. This paper analyzes FDS's properties, showing that minimizing the size of its search tree guided by ranked branching decisions is closely related to the Multi-armed bandit (MAB) problem. Building on this insight, MAB reinforcement learning algorithms are applied to FDS, extended with problem-specific refinements and parameter tuning, and evaluated on the two most fundamental scheduling problems, the Job Shop Scheduling Problem (JSSP) and Resource-Constrained Project Scheduling Problem (RCPSP). The resulting enhanced FDS, using the best extended MAB algorithm and configuration, performs 1.7 times faster on the JSSP and 2.1 times faster on the RCPSP benchmarks compared to the original implementation in a new solver called OptalCP, while also being 3.5 times faster on the JSSP and 2.1 times faster on the RCPSP benchmarks than the current state-of-the-art FDS algorithm in IBM CP Optimizer 22.1. Furthermore, using only a 900-second time limit per instance, the enhanced FDS improved the existing state-of-the-art lower bounds of 78 of 84 JSSP and 226 of 393 RCPSP standard open benchmark instances while also completely closing a few of them.
- Europe > Czechia > Prague (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- South America > Argentina > Pampas > Buenos Aires F.D. > Buenos Aires (0.04)
- (5 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (1.00)
- (2 more...)
- Information Technology > Data Science > Data Mining > Big Data (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
Adversarial Bandit over Bandits: Hierarchical Bandits for Online Configuration Management
Avin, Chen, Lotker, Zvi, Mannor, Shie, Shabat, Gil, Shteingart, Hanan, Yadgar, Roey
Motivated by dynamic parameter optimization in finite, but large action (configurations) spaces, this work studies the nonstochastic multi-armed bandit (MAB) problem in metric action spaces with oblivious Lipschitz adversaries. We propose ABoB, a hierarchical Adversarial Bandit over Bandits algorithm that can use state-of-the-art existing "flat" algorithms, but additionally clusters similar configurations to exploit local structures and adapt to changing environments. We prove that in the worst-case scenario, such clustering approach cannot hurt too much and ABoB guarantees a standard worst-case regret bound of $O\left(k^{\frac{1}{2}}T^{\frac{1}{2}}\right)$, where $T$ is the number of rounds and $k$ is the number of arms, matching the traditional flat approach. However, under favorable conditions related to the algorithm properties, clusters properties, and certain Lipschitz conditions, the regret bound can be improved to $O\left(k^{\frac{1}{4}}T^{\frac{1}{2}}\right)$. Simulations and experiments on a real storage system demonstrate that ABoB, using standard algorithms like EXP3 and Tsallis-INF, achieves lower regret and faster convergence than the flat method, up to 50% improvement in known previous setups, nonstochastic and stochastic, as well as in our settings.
- Asia > Middle East > Israel (0.04)
- Oceania > Australia > New South Wales > Sydney (0.04)
Large Language Model-Enhanced Multi-Armed Bandits
Sun, Jiahang, Wang, Zhiyong, Yang, Runhan, Xiao, Chenjun, Lui, John C. S., Dai, Zhongxiang
Large language models (LLMs) have been adopted to solve sequential decision-making tasks such as multi-armed bandits (MAB), in which an LLM is directly instructed to select the arms to pull in every iteration. However, this paradigm of direct arm selection using LLMs has been shown to be suboptimal in many MAB tasks. Therefore, we propose an alternative approach which combines the strengths of classical MAB and LLMs. Specifically, we adopt a classical MAB algorithm as the high-level framework and leverage the strong in-context learning capability of LLMs to perform the sub-task of reward prediction. Firstly, we incorporate the LLM-based reward predictor into the classical Thompson sampling (TS) algorithm and adopt a decaying schedule for the LLM temperature to ensure a transition from exploration to exploitation. Next, we incorporate the LLM-based reward predictor (with a temperature of 0) into a regression oracle-based MAB algorithm equipped with an explicit exploration mechanism. We also extend our TS-based algorithm to dueling bandits where only the preference feedback between pairs of arms is available, which requires non-trivial algorithmic modifications. We conduct empirical evaluations using both synthetic MAB tasks and experiments designed using real-world text datasets, in which the results show that our algorithms consistently outperform previous baseline methods based on direct arm selection. Interestingly, we also demonstrate that in challenging tasks where the arms lack semantic meanings that can be exploited by the LLM, our approach achieves considerably better performance than LLM-based direct arm selection.
- Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
- Asia > China > Hong Kong (0.04)
- Asia > Bangladesh (0.04)
- (2 more...)
Minimizing Queue Length Regret for Arbitrarily Varying Channels
Krishnakumar, G, Sinha, Abhishek
We consider an online channel scheduling problem for a single transmitter-receiver pair equipped with $N$ arbitrarily varying wireless channels. The transmission rates of the channels might be non-stationary and could be controlled by an oblivious adversary. At every slot, incoming data arrives at an infinite-capacity data queue located at the transmitter. A scheduler, which is oblivious to the current channel rates, selects one of the $N$ channels for transmission. At the end of the slot, the scheduler only gets to know the transmission rate of the selected channel. The objective is to minimize the queue length regret, defined as the difference between the queue length at some time $T$ achieved by an online policy and the queue length obtained by always transmitting over the single best channel in hindsight. We propose a weakly adaptive Multi-Armed Bandit (MAB) algorithm for minimizing the queue length regret in this setup. Unlike previous works, we do not make any stability assumptions about the queue or the arrival process. Hence, our result holds even when the queueing process is unstable. Our main observation is that the queue length regret can be upper bounded by the regret of a MAB policy that competes against the best channel in hindsight uniformly over all sub-intervals of $[T]$. As a technical contribution of independent interest, we then propose a weakly adaptive adversarial MAB policy which achieves $\tilde{O}(\sqrt{N}T^{\frac{3}{4}})$ regret with high probability, implying the same bound for queue length regret.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > India > Tamil Nadu > Chennai (0.04)
- Asia > India > Maharashtra > Mumbai (0.04)
Dynamic Learning Rate for Deep Reinforcement Learning: A Bandit Approach
Donâncio, Henrique, Barrier, Antoine, South, Leah F., Forbes, Florence
Reinforcement Learning (RL), when combined with function approximators such as Artificial Neural Networks (ANNs), has shown success in learning policies that outperform humans in complex games by leveraging extensive datasets (see, e.g., 33, 19, 39, 40). While ANNs were previously used as value function approximators [29], the introduction of Deep Q-Networks (DQN) by [24, 25] marked a significant breakthrough by improving learning stability through two mechanisms: the target network and experience replay. The experience replay (see 22) stores the agent's interactions within the environment, allowing sampling of past interactions in a random way that disrupts their correlation. The target network further stabilizes the learning process by periodically copying the parameters of the learning network. This strategy is crucial because the Bellman update --using estimations to update other estimations-- would otherwise occur using the same network, potentially causing divergence. By leveraging the target network, gradient steps are directed towards a periodically fixed target, ensuring more stability in the learning process. Additionally, the learning rate hyperparameter controls the magnitude of these gradient steps in optimizers such as the stochastic gradient descent algorithm, affecting the training convergence. The learning rate is one of the most important hyperparameters, with previous work demonstrating that decreasing its value during policy finetuning can enhance performance by up to 25% in vanilla DQN [3].
- Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.05)
- Oceania > Australia > Queensland > Brisbane (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (8 more...)
Reviews: Bandit Learning with Positive Externalities
The paper studies the interesting problem of learning with externalities, in a multi-armed bandit (MAB) setting. The main idea is that there might be a bias in the preferences in the users arriving on on-line platforms. Specifically, future user arrivals on the on-line platforms are likely to have similar preferences to users who have previously accessed the same platform and were satisfied with the service. Since some on-line platforms use MAB algorithms for optimizing their service, the authors propose the Balanced Exploration (BE) MAB algorithm, which has a structured exploration strategy that takes into account this potential "future user preference bias" (referred to as "positive externalities"). The bias in the preference of the users is translated directly into reward values specific to users arriving to on-line platform: out of the m possible items/arms, each user has a preference for a subset of them (the reward for this being a Bernoulli reward with mean proportional to the popularity of the arm) and the rewards of all other arms will always be null.
A framework for Multi-A(rmed)/B(andit) Testing with Online FDR Control
Fanny Yang, Aaditya Ramdas, Kevin G. Jamieson, Martin J. Wainwright
We propose an alternative framework to existing setups for controlling false alarms when multiple A/B tests are run over time. This setup arises in many practical applications, e.g. when pharmaceutical companies test new treatment options against control pills for different diseases, or when internet companies test their default webpages versus various alternatives over time. Our framework proposes to replace a sequence of A/B tests by a sequence of best-arm MAB instances, which can be continuously monitored by the data scientist. When interleaving the MAB tests with an online false discovery rate (FDR) algorithm, we can obtain the best of both worlds: low sample complexity and any time online FDR control. Our main contributions are: (i) to propose reasonable definitions of a null hypothesis for MAB instances; (ii) to demonstrate how one can derive an always-valid sequential p-value that allows continuous monitoring of each MAB test; and (iii) to show that using rejection thresholds of online-FDR algorithms as the confidence levels for the MAB algorithms results in both sample-optimality, high power and low FDR at any point in time. We run extensive simulations to verify our claims, and also report results on real data collected from the New Yorker Cartoon Caption contest.
- North America > United States > New York (0.24)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- Asia > Middle East > Jordan (0.04)